Inferring statistically significant features from random forests

نویسندگان

Jérôme Paul

Pierre Dupont

چکیده

Embedded feature selection can be performed by analyzing the variables used in a Random Forest. Such a multivariate selection takes into account the interactions between variables but is not straightforward to interpret in a statistical sense. We propose a statistical procedure to measure variable importance that tests if variables are significantly useful in combination with others in a forest. We show experimentally that this new importance index correctly identifies relevant variables. The top of the variable ranking is largely correlated with Breiman’s importance index based on a permutation test. Our measure has the additional benefit to produce p-values from the forest voting process. Such p-values offer a very natural way to decide which features are significantly relevant while controlling the false discovery rate. Practical experiments are conducted on synthetic and real data including low and high-dimensional datasets for binary or multi-class problems. Results show that the proposed technique is effective and outperforms recent alternatives by reducing the computational complexity of the selection process by an order of magnitude while keeping similar performances.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of Statistically Significant Features from Random Forests

متن کامل

Sequential Feature Selection and Inference using Multivariate Random Forests.

Motivation Random forest has become a widely popular prediction generating mechanism. Its strength lies in its flexibility, interpretability and ability to handle large number of features, typically larger than the sample size. However, this methodology is of limited use if one wishes to identify statistically significant features. Several ranking schemes are available that provide information ...

متن کامل

Author gender identification from text using Bayesian Random Forest

Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields,...

متن کامل

Human Activity Recognition with Random Forests

2 This paper describes and analyzes an application of the random forests 3 machine learning technique resulting in identification of human activities. 4 The data originates from the sensors on a Samsung Galaxy S2 smart phone, 5 and underwent significant processing to transform the 6 sensor outputs into 6 a set of 561 features. Applying random forests required significant efforts in 7 parameter ...

متن کامل

Extensions to Quantile Regression Forests for Very High-Dimensional Data

This paper describes new extensions to the state-of-the-art regression random forests Quantile Regression Forests (QRF) for applications to high dimensional data with thousands of features. We propose a new subspace sampling method that randomly samples a subset of features from two separate feature sets, one containing important features and the other one containing less important features. Th...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Neurocomputing

دوره 150 شماره

صفحات -

تاریخ انتشار 2015

Inferring statistically significant features from random forests

نویسندگان

چکیده

منابع مشابه

Identification of Statistically Significant Features from Random Forests

Sequential Feature Selection and Inference using Multivariate Random Forests.

Author gender identification from text using Bayesian Random Forest

Human Activity Recognition with Random Forests

Extensions to Quantile Regression Forests for Very High-Dimensional Data

عنوان ژورنال:

اشتراک گذاری